Dokument przedstawia wstępną analizę danych INC 5000 2019.

Przetwarzając dane korzystałem z pomocy ChatGPT.

Dane są dostępne do pobrania na stronie:

https://www.kaggle.com/datasets/mysarahmadbhat/inc-5000-companies

Przygotowanie środowiska pracy i załadowanie danych

Załadowanie niezbędnych bibliotek

Do przygotowania danych wykorzystamy pakiet tidyverse, wspomagając się ggthemes i plotly do przygotowania wizualizacji. Aby przeanalizować miary asymetrii rozkładu zmiennych ilościowych, użyjemy biblioteki moments.

library(tidyverse)
library(ggthemes)
library(plotly)
library(moments)

Załadowanie danych

Ładujemy dane do ramki danych inc przy użyciu funkcji read.csv:

setwd(dir = 'D:/wd/')
inc <- read.csv(file = 'INC 5000 Companies 2019.csv', header = TRUE, sep = ',',na.strings = c("","NA"))

Przygotowanie danych

Wstępna ocena struktury danych

str(inc)
## 'data.frame':    5012 obs. of  14 variables:
##  $ rank            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ profile         : chr  "https://www.inc.com/profile/freestar" "https://www.inc.com/profile/freightwise" "https://www.inc.com/profile/ceces-veggie" "https://www.inc.com/profile/ladyboss" ...
##  $ name            : chr  "Freestar" "FreightWise" "Cece's Veggie Co." "LadyBoss" ...
##  $ url             : chr  "http://freestar.com" "http://freightwisellc.com" "http://cecesveggieco.com" "http://ladyboss.com" ...
##  $ state           : chr  "AZ" "TN" "TX" "NM" ...
##  $ revenue         : chr  "36.9 Million" "33.6 Million" "24.9 Million" "32.4 Million" ...
##  $ growth_.        : num  36680 30548 23880 21850 18166 ...
##  $ industry        : chr  "Advertising & Marketing" "Logistics & Transportation" "Food & Beverage" "Consumer Products & Services" ...
##  $ workers         : int  40 39 190 57 25 742 12 72 60 37 ...
##  $ previous_workers: int  5 8 10 2 6 18 1 1 10 5 ...
##  $ founded         : int  2015 2015 2015 2014 2014 2009 2014 2015 2008 2014 ...
##  $ yrs_on_list     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ metro           : chr  "Phoenix" "Nashville" "Austin" NA ...
##  $ city            : chr  "Phoenix" "Brentwood" "Austin" "Albuquerque" ...

Po wstępnym spojrzeniu na dane widzimy, że:

  • Dane zawieraja pole profile, które nie będzie podlegać dalszej analizie.

  • Nazwa zmiennej growth_. odbiega od reszty - należy ujednolicić nazewnictwo.

  • Zmienna revenue zawiera liczbę wraz z jednostką (Million) - należy ją przeanalizować i skonwertować do formatu liczbowego.

  • Dodatkowo zmienne rank, metro, city, state mogą być skonwertowane do typu factor, jednak nastąpi do dopiero po ich oczyszczeniu.

Na tym etapie usuniemy kolumnę profile i ustalamy nową nazwę zmiennej growth_.

Usunięcie kolumny profile

inc <- select(.data = inc,-profile)

Zmiana nazwy kolumny growth_.

colnames(inc)[6] <- "growth"

Wykrywanie zduplikowanych wierszy

Do analizy zduplikowanych wierszy użyjemy kombinacji funkcji sum i duplicated:

cat('Tabela INC zawiera',sum(duplicated(inc)),'zduplikowanych wierszy')
## Tabela INC zawiera 0 zduplikowanych wierszy
for (i in 1:ncol(inc)) {
  cat(names(inc[i]),
      " - duplikaty:",
      length(subset(duplicated(inc[i]),
                    duplicated(inc[i]) == TRUE)),
       "\n \n")
}
## rank  - duplikaty: 13 
##  
## name  - duplikaty: 0 
##  
## url  - duplikaty: 0 
##  
## state  - duplikaty: 4961 
##  
## revenue  - duplikaty: 3997 
##  
## growth  - duplikaty: 6 
##  
## industry  - duplikaty: 4985 
##  
## workers  - duplikaty: 4376 
##  
## previous_workers  - duplikaty: 4569 
##  
## founded  - duplikaty: 4929 
##  
## yrs_on_list  - duplikaty: 4998 
##  
## metro  - duplikaty: 4941 
##  
## city  - duplikaty: 3454 
## 

Wykrywanie pustych wierszy

sum(complete.cases(inc))
## [1] 4198
colSums(is.na(inc))
##             rank             name              url            state 
##                0                0                0                0 
##          revenue           growth         industry          workers 
##                0                0                0                1 
## previous_workers          founded      yrs_on_list            metro 
##                0                0                0              813 
##             city 
##                0
cat('Tabela inc zawiera',nrow(inc) - sum(complete.cases(inc)),'niekompletnych rekordów' )
## Tabela inc zawiera 814 niekompletnych rekordów

Tabela zawiera 814 niekompletnych rekordów. Niekompletne dane występują w zmiennej metro i workers

Analiza poszczególnych zmiennych

rank

rank to zmienna wyznaczająca pozycję danej firmy w rankingu INC.

cat('W zmiennej rank występuje',sum(duplicated(inc$rank)),'dupliaktów')
## W zmiennej rank występuje 13 dupliaktów

Analiza pod kątem duplikatów wykazuje, że zmienna zawiera duplikaty, tzn. w rankingu występują remisy.

Zmienna rank jest w typie int, jednak nie będzie traktowana jako zmienna numeryczna, dlatego zostanie skonwertowana do typu factor.

inc$rank <- as.factor(inc$rank)

name

name to zmienna zawierająca nazwę firmy w typie character. Zgodnie z wcześniejszymi analizami, zmienna nie posiada duplikatów ani wartości pustych.

Przykładowe wartości:

head(inc$name)
## [1] "Freestar"          "FreightWise"       "Cece's Veggie Co."
## [4] "LadyBoss"          "Perpay"            "Cano Health"

url

url to zmienna zawierająca adres URL witryny firmy. Zmienna nie posiada duplikatów ani wartości pustych.

Przykładowe wartości:

head(inc$url)
## [1] "http://freestar.com"       "http://freightwisellc.com"
## [3] "http://cecesveggieco.com"  "http://ladyboss.com"      
## [5] "http://perpay.com"         "http://canohealth.com"

state

Zmienna state zawiera informację o stanie w USA w jakim mieści się dana firma zapisane w postaci kombinacji dwóch dużych liter.

Zmienna ta zostanie skonwertowana do typu factor na potrzeby dalszych analiz. Poniższy kod zwraca również unikatowe wartości zmiennej state:

unique(inc$state)
##  [1] "AZ" "TN" "TX" "NM" "PA" "FL" "NJ" "VA" "OH" "CA" "CO" "WA" "NY" "GA" "KY"
## [16] "ID" "UT" "MT" "NC" "WI" "MI" "MD" "MN" "AL" "MA" "ME" "KS" "IL" "CT" "NV"
## [31] "ND" "NH" "MO" "IN" "DC" "NE" "WY" "SC" "LA" "PR" "OR" "IA" "OK" "AR" "DE"
## [46] "SD" "WV" "MS" "VT" "HI" "RI"
inc$state <- as.factor(inc$state)

revenue

revenue zawiera informacje o przychodzie firmy wraz z jednostką (milion lub bilion - tzn. w notacji amerykańskiej: miliard).

head(inc$revenue)
## [1] "36.9 Million"  "33.6 Million"  "24.9 Million"  "32.4 Million" 
## [5] "22.5 Million"  "271.8 Million"

Obecny format zmiennej uniemożliwia podjęcie analiz ilościowych, dlatego zostanie ona rozdzielona na dwie zmienne pomocnicze - revenue_value i revenue_unit:

inc <- separate(data = inc,col = revenue,into = c("revenue_value","revenue_unit"),sep = " " )
inc$revenue_value <- as.numeric(inc$revenue_value)

Następnie zmienna zostanie skonwertowana do jednej jednostki (milionów dolarów).

Na początku upewnijmy się, że zmienna posiada faktycznie dwie unikatowe wartości:

unique(inc$revenue_unit)
## [1] "Million" "Billion"

Instrukcja warunkowa tworzy nowy wektor revenue na podstawie wartości w komórkach revenue_value i revenue_unit:

inc$revenue <- ifelse(test = inc$revenue_unit == "Billion", 
       yes = inc$revenue_value * 1000,
       no = inc$revenue_value)

Ostatecznie zmienne pomocnicze zostaną usunięte:

inc <- select(.data = inc,rank,name,url,state,revenue,growth,industry,
              workers,previous_workers,founded,yrs_on_list,metro,city)
str(inc)
## 'data.frame':    5012 obs. of  13 variables:
##  $ rank            : Factor w/ 4999 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ name            : chr  "Freestar" "FreightWise" "Cece's Veggie Co." "LadyBoss" ...
##  $ url             : chr  "http://freestar.com" "http://freightwisellc.com" "http://cecesveggieco.com" "http://ladyboss.com" ...
##  $ state           : Factor w/ 51 levels "AL","AR","AZ",..: 3 43 44 32 38 9 31 46 35 4 ...
##  $ revenue         : num  36.9 33.6 24.9 32.4 22.5 ...
##  $ growth          : num  36680 30548 23880 21850 18166 ...
##  $ industry        : chr  "Advertising & Marketing" "Logistics & Transportation" "Food & Beverage" "Consumer Products & Services" ...
##  $ workers         : int  40 39 190 57 25 742 12 72 60 37 ...
##  $ previous_workers: int  5 8 10 2 6 18 1 1 10 5 ...
##  $ founded         : int  2015 2015 2015 2014 2014 2009 2014 2015 2008 2014 ...
##  $ yrs_on_list     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ metro           : chr  "Phoenix" "Nashville" "Austin" NA ...
##  $ city            : chr  "Phoenix" "Brentwood" "Austin" "Albuquerque" ...

Przyjrzyjmy się teraz statystykom opisowym dla zmiennej revenue:

summary(inc$revenue)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     2.00     4.80    10.40    47.47    27.00 21400.00
cat('\n Odchylenie standardowe zmiennej revenue wynosi',sd(inc$revenue))
## 
##  Odchylenie standardowe zmiennej revenue wynosi 391.3343
cat('\n Skośność rozkładu zmiennej revenue wynosi:',skewness(inc$revenue))
## 
##  Skośność rozkładu zmiennej revenue wynosi: 39.35647
cat('\n Kurtoza rozkładu zmiennej revenue wynosi:',kurtosis(inc$revenue))
## 
##  Kurtoza rozkładu zmiennej revenue wynosi: 1931.193

Rozkład zmiennej jest silnie prawostronnie skośny i leptokurtyczny. Obserwacje znacznie częściej niż w rozkładzie normalnym przyjmują wartości skrajne.

Sprawdzamy normalność rozkładu testem Kołmogorowa-Smirnova:

ks.test(x = inc$revenue,y = "pnorm", mean = mean(inc$revenue), sd = sd(inc$revenue))
## Warning in ks.test.default(x = inc$revenue, y = "pnorm", mean =
## mean(inc$revenue), : wartości powtórzone nie powinny być obecne w teście
## Kolmogorowa-Smirnowa
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  inc$revenue
## D = 0.45375, p-value < 2.2e-16
## alternative hypothesis: two-sided

Rozkład zmiennej revenue odbiega kształtem od rozkładu normalnego.

growth

Zmienna growth przedstawia procentową wartość wzrostu w ciągu ostatnich trzech lat - zgodnie z informacją zawartą na stronach:

https://www.inc.com/inc5000/2019/top-private-companies-2019-inc5000.html

W oryginalnym zestawieniu INC 5000 dane podane są procentowo, z dokładnością do trzech miejsc po przecinku (patrz: https://www.inc.com/inc5000/2019)

Zestaw danych podanych z Kaggle zawiera błędny zapis danych jako pięciocyfrowa liczba całkowita, dlatego konieczne jest wprowadzenie modyfikacji:

inc$three_years_growth_percent <- inc$growth/1000
inc$three_years_growth_percent <- round(inc$three_years_growth_percent,digits = 3)
inc <- select(.data = inc,rank,name,url,state,revenue,three_years_growth_percent,
              industry,workers,previous_workers,founded,yrs_on_list,metro,city)

industry

Industry to zmienna określająca branżę, w której działa firma.

Poniżej zestawienie unikatowych wartości zmiennej i ilość wystąpień dla każdej z nich.

inc$industry <- as.factor(inc$industry)
freq_table <- table(inc$industry)
sorted_freq_table <- sort(freq_table, decreasing = TRUE)
print(sorted_freq_table)
## 
## Business Products & Services      Advertising & Marketing 
##                          492                          489 
##                     Software                       Health 
##                          461                          356 
##                 Construction Consumer Products & Services 
##                          350                          315 
##                IT Management           Financial Services 
##                          276                          239 
##          Government Services                  Real Estate 
##                          236                          198 
##   Logistics & Transportation                Manufacturing 
##                          186                          181 
##                       Retail              Human Resources 
##                          163                          157 
##              Food & Beverage        IT System Development 
##                          127                          120 
##                  Engineering           Telecommunications 
##                           81                           79 
##                       Energy                    Education 
##                           78                           70 
##                    Insurance                     Security 
##                           70                           67 
##         Travel & Hospitality                        Media 
##                           57                           46 
##       Environmental Services                  IT Services 
##                           43                           43 
##            Computer Hardware 
##                           32
allindustries <- as.data.frame(sorted_freq_table)
colnames(allindustries) <- c("Industry","Frequency")

industry_plot <- ggplot(allindustries,aes(x =Industry,y = Frequency,fill  = Frequency)) +
  geom_bar(stat = "identity") + theme_minimal() +
  scale_fill_gradient(low = "darkseagreen1",high = "darkseagreen4") +
  theme(axis.title= element_blank(),
        legend.position = 'none',
        axis.text.x = element_text(size = 8, angle = 30, hjust = 1 )) +
  labs(title = 'Wykres częstości zmiennej industry')

ggplotly(industry_plot,tooltip = c("x","y"))

Zmienna zostanie skonwertowana do typu factor:

inc$industry <- as.factor(inc$industry)

workers

Workers to zmienna określająca liczbę pracowników

summary(inc$workers)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##      0.0     22.0     48.0    242.8    116.0 155000.0        1

Z podsumowania wynika, że w zestawieniu znajdują się firmy, które mają 0 pracowników.

subset(inc,workers == 0)
##      rank                                  name                         url
## 246   245                           Tier4 Group       http://tier4group.com
## 1974 1970              Synapse Business Systems         synapsebsystems.com
## 4568 4556 IDS International Government Services        idsinternational.com
## 4943 4931             Green Mountain Technology greenmountaintechnology.com
##      state revenue three_years_growth_percent                     industry
## 246     GA     4.6                      1.729 Business Products & Services
## 1974    VA     3.0                      0.204                IT Management
## 4568    VA    44.7                      0.064          Government Services
## 4943    TN    21.6                      0.054   Logistics & Transportation
##      workers previous_workers founded yrs_on_list          metro      city
## 246        0                3    2010           1        Atlanta   Atlanta
## 1974       0                4    2013           1 Washington, DC   Fairfax
## 4568       0              682    2006           3 Washington, DC Arlington
## 4943       0               48    1999           3           <NA>   Memphis

Aby wyniki nie wpływały na dalsze analizy, ustawimy dla nich wartości NA, co wykluczy te rekordy z dalszych analiz pod kątem zmiennej workers:

inc[c(246,1974,4568,4943),"workers"] <- NA
inc[c(246,1974,4568,4943),"workers"]
## [1] NA NA NA NA

previous_workers

previous_ workers to zmienna określająca liczbę pracowników w poprzednim okresie. Ponieważ czas ten nie jest znany, kolumna zostanie usunięta z danych.

inc <- select(inc,-previous_workers)

founded

Zmienna founded zawiera datę założenia firmy.

summary(inc$founded)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    2003    2009    2005    2012    2016
sort(unique(inc$founded))
##  [1]    0 1869 1884 1895 1897 1899 1902 1909 1910 1914 1917 1923 1925 1927 1928
## [16] 1929 1932 1939 1941 1945 1946 1948 1949 1951 1953 1955 1956 1957 1959 1961
## [31] 1962 1963 1964 1965 1967 1968 1969 1970 1972 1973 1974 1975 1976 1977 1978
## [46] 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993
## [61] 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
## [76] 2009 2010 2011 2012 2013 2014 2015 2016

Podsumowanie wskazuje, że do danych wkradły się błędy - na liście INC 2019 są firmy, które założone zostały w roku 0. Potrzeba zatem przyjrzeć się danym:

subset(inc,founded == 0)
##      rank                  name                            url state revenue
## 4726 4714 Nassau National Cable http://nassaunationalcable.com    NY      11
##      three_years_growth_percent                     industry workers founded
## 4726                       0.06 Business Products & Services      30       0
##      yrs_on_list         metro       city
## 4726           6 New York City GREAT NECK

Po szybkim sprawdzeniu, historii firmy Nassau National Cable w internecie, odnalazłem informację, że za datę jej powstania można przyjąć rok 1950.

inc[4726,'founded'] <- 1950
freq_founded <- as.data.frame(table(inc$founded))
colnames(freq_founded) <- c('Founded','Freq')
freq_founded[order(freq_founded$Freq, decreasing = TRUE),]
##    Founded Freq
## 81    2014  466
## 79    2012  440
## 80    2013  431
## 78    2011  377
## 76    2009  363
## 77    2010  360
## 75    2008  299
## 74    2007  255
## 73    2006  198
## 71    2004  188
## 72    2005  183
## 82    2015  172
## 70    2003  161
## 68    2001  133
## 69    2002  131
## 67    2000   95
## 66    1999   93
## 64    1997   68
## 63    1996   63
## 65    1998   53
## 62    1995   47
## 61    1994   36
## 60    1993   30
## 56    1989   27
## 57    1990   24
## 59    1992   24
## 58    1991   23
## 52    1985   22
## 54    1987   22
## 49    1982   20
## 55    1988   19
## 53    1986   17
## 45    1978   11
## 50    1983   11
## 51    1984   11
## 43    1976   10
## 46    1979   10
## 47    1980   10
## 48    1981    9
## 37    1969    7
## 40    1973    7
## 44    1977    6
## 20    1946    5
## 34    1965    5
## 41    1974    5
## 27    1956    4
## 38    1970    4
## 39    1972    4
## 42    1975    4
## 22    1949    3
## 25    1953    3
## 29    1959    3
## 31    1962    3
## 12    1925    2
## 14    1928    2
## 18    1941    2
## 19    1945    2
## 26    1955    2
## 33    1964    2
## 35    1967    2
## 1     1869    1
## 2     1884    1
## 3     1895    1
## 4     1897    1
## 5     1899    1
## 6     1902    1
## 7     1909    1
## 8     1910    1
## 9     1914    1
## 10    1917    1
## 11    1923    1
## 13    1927    1
## 15    1929    1
## 16    1932    1
## 17    1939    1
## 21    1948    1
## 23    1950    1
## 24    1951    1
## 28    1957    1
## 30    1961    1
## 32    1963    1
## 36    1968    1
## 83    2016    1

yrs_on_list

Zmienna yrs_on_list wskazuje, od ilu lat firma znajduje się w rankingu INC 5000.

summary(inc$yrs_on_list)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   2.814   4.000  14.000
inc$yrs_on_list <- as.factor(inc$yrs_on_list)

Zmienna przyjmuje wartości z zakresu 1 - 14

Na poniższym wykresie możemy zobaczyć, ile firm odznacza się określoną liczbą lat na liście:

freq_yrs_on_list <-  as.data.frame(table(inc$yrs_on_list))
colnames(freq_yrs_on_list) <- c('years_on_list', 'frequency')
yrs_on_list_plot <- ggplot(freq_yrs_on_list, aes(x = years_on_list, y = frequency)) +
  geom_col(fill = 'darkseagreen4', color = 'black') + 
  theme_minimal() + labs(title = 'Wykres częstości zmiennej years on list',
                         x = 'Lata na liście',
                         y = 'Częstość')

ggplotly(yrs_on_list_plot)

metro

Zmienna metro oznacza metropolię. Zweryfikujmy, czy metropolie nie powtarzają się (np. przez różnice w zapisie)

sort(unique(inc$metro))
##  [1] "Albany-Schenectady-Troy, NY"         
##  [2] "Allentown-Bethlehem-Easton, PA-NJ"   
##  [3] "Ann Arbor, MI"                       
##  [4] "Asheville, NC"                       
##  [5] "Atlanta"                             
##  [6] "Austin"                              
##  [7] "Baltimore"                           
##  [8] "Baton Rouge, LA"                     
##  [9] "Birmingham, AL"                      
## [10] "Boise City-Nampa, ID"                
## [11] "Boston"                              
## [12] "Boulder, CO"                         
## [13] "Bridgeport-Stamford-Norwalk, CT"     
## [14] "Charleston, SC"                      
## [15] "Charlotte"                           
## [16] "Chicago"                             
## [17] "Cincinnati"                          
## [18] "Cleveland"                           
## [19] "Columbia, SC"                        
## [20] "Columbus, OH"                        
## [21] "Dallas"                              
## [22] "Denver"                              
## [23] "Des Moines, IA"                      
## [24] "Detroit"                             
## [25] "Durham, NC"                          
## [26] "Greenville-Mauldin-Easley, SC"       
## [27] "Houston"                             
## [28] "Huntsville, AL"                      
## [29] "Indianapolis, IN"                    
## [30] "Inland Empire, CA"                   
## [31] "Jacksonville, FL"                    
## [32] "Kansas City, MO-KS"                  
## [33] "Lancaster, PA"                       
## [34] "Las Vegas, NV"                       
## [35] "Los Angeles"                         
## [36] "Louisville/Jefferson County, KY-IN"  
## [37] "Madison, WI"                         
## [38] "Miami"                               
## [39] "Milwaukee"                           
## [40] "Minneapolis"                         
## [41] "Nashville"                           
## [42] "New Orleans"                         
## [43] "New York City"                       
## [44] "Ogden-Clearfield, UT"                
## [45] "Oklahoma City, OK"                   
## [46] "Omaha-Council Bluffs, NE-IA"         
## [47] "Orlando, FL"                         
## [48] "Oxnard-Thousand Oaks-Ventura, CA"    
## [49] "Philadelphia"                        
## [50] "Phoenix"                             
## [51] "Pittsburgh, PA"                      
## [52] "Provo-Orem, UT"                      
## [53] "Raleigh, NC"                         
## [54] "Richmond, VA"                        
## [55] "Rochester, NY"                       
## [56] "Sacramento, CA"                      
## [57] "Salt Lake City"                      
## [58] "San Antonio, TX"                     
## [59] "San Diego"                           
## [60] "San Francisco"                       
## [61] "San Jose"                            
## [62] "Santa Barbara-Santa Maria-Goleta, CA"
## [63] "Sarasota-Bradenton-Venice, FL"       
## [64] "Seattle"                             
## [65] "Springfield, MO"                     
## [66] "St. Louis, MO-IL"                    
## [67] "Tampa"                               
## [68] "Tulsa, OK"                           
## [69] "Virginia Beach"                      
## [70] "Washington, DC"

Niektóre metropolie mają dodany po przecinku skrót stanu, który nie będzie potrzebny w dalszej analizie, ponieważ jest już zawarty w zmiennej state, dlatego zostanie usunięty ze zmiennej.

inc$metro <- sub(',.*','',inc$metro)

city

Zmienna city zawiera informacje o mieście.

Sprawdźmy jej unikatowe wartości, żeby wychwycić czy nie ma w nich powtórzeń spowodowanych różnicami w zapisie

sort(unique(inc$city))
##    [1] ":Livermore"            "Aberdeen"              "Abilene"              
##    [4] "Acton"                 "Ada"                   "Addison"              
##    [7] "Agoura Hills"          "Ahaheim"               "Akron"                
##   [10] "Alameda"               "Albany"                "ALBANY"               
##   [13] "Albuquerque"           "ALBUQUERQUE"           "Alexandria"           
##   [16] "Aliso Viejo"           "Allegan"               "Allen"                
##   [19] "ALLEN"                 "Allendale"             "Allentown"            
##   [22] "Alpena"                "Alpharetta"            "ALPHARETTA"           
##   [25] "Altamonte Springs"     "Alton"                 "Amarillo"             
##   [28] "Ambler"                "Ambridge"              "American Fork"        
##   [31] "Amherst"               "anaheim"               "Anaheim"              
##   [34] "ANAHEIM"               "Andover"               "Ann Arbor"            
##   [37] "Annandale"             "Annapolis"             "Annapolis Junction"   
##   [40] "Apex"                  "Apollo Beach"          "Arbutus"              
##   [43] "Argyle"                "Arlignton"             "Arlington"            
##   [46] "Arlington Heights"     "Arvada"                "Arvin"                
##   [49] "Asbury Park"           "Ashburn"               "ASHBURN"              
##   [52] "Asheville"             "Ashland"               "athens"               
##   [55] "Athens"                "Atlanta"               "Auburn"               
##   [58] "Auburn Hills"          "Auburndale"            "Augusta"              
##   [61] "Aurora"                "AURORA"                "Austell"              
##   [64] "Austin"                "Aventura"              "Avon"                 
##   [67] "Avondale"              "Bakersfield"           "Bala Cynwyd"          
##   [70] "Ballwin"               "Baltimore"             "Bannockburn"          
##   [73] "Baraboo"               "Barberton"             "Barnesville"          
##   [76] "Bartlett"              "Bartonville"           "Batavia"              
##   [79] "Baton Rouge"           "Battlefield"           "bay shore"            
##   [82] "Bay Shore"             "Bayside"               "Beaufort"             
##   [85] "Beavercreek"           "Beaverton"             "Bedford"              
##   [88] "Bedford Hts"           "Bedminster"            "Bee Cave"             
##   [91] "Bee Caves"             "Bell Canyon"           "Belleville"           
##   [94] "BELLEVILLE"            "Bellevue"              "Bellingham"           
##   [97] "Bellmawr"              "Belmont"               "Beltsville"           
##  [100] "Bend"                  "BEND"                  "Bennington"           
##  [103] "Bensalem"              "Berkeley"              "Berkeley Springs"     
##  [106] "Berwyn"                "Bessemer"              "Bethel"               
##  [109] "Bethesda"              "Bethlehem"             "Beverly"              
##  [112] "Beverly Hills"         "Billings"              "bingham farms"        
##  [115] "Birmimgham"            "Birmingham"            "Bismarck"             
##  [118] "Blaine"                "Blasdell"              "Blauvelt"             
##  [121] "BLOOMFIELD"            "Bloomfield Hills"      "Bloomingdale"         
##  [124] "Bloomington"           "Bloomsburg"            "Blue Ash"             
##  [127] "Blue Bell"             "Blue Mounds"           "Blue Springs"         
##  [130] "Bluffdale"             "Bluffton"              "Boca Raton"           
##  [133] "Bohemia"               "Boise"                 "Bolingbrook"          
##  [136] "Bon Aqua"              "Boothwyn"              "Boston"               
##  [139] "BOSTON"                "Bothell"               "Boulder"              
##  [142] "Bountiful"             "Bowie"                 "Boynton Beach"        
##  [145] "Bozeman"               "Bradenton"             "BRADENTON"            
##  [148] "Braintree"             "BRAINTREE"             "Branchburg"           
##  [151] "Brandon"               "Brea"                  "Breckenridge"         
##  [154] "Brecksville"           "Brentwood"             "Brick"                
##  [157] "Bridgeport"            "Brisbane"              "Brockton"             
##  [160] "Bronx"                 "Brookfield"            "Brookhaven"           
##  [163] "Brooklyn"              "Brooklyn Park"         "Brookshire"           
##  [166] "Brookston"             "Broomfield"            "Brossard"             
##  [169] "Brunswick"             "Buellton"              "Buena Park"           
##  [172] "Buffalo"               "Buffalo Grove"         "Buford"               
##  [175] "BUFORD"                "Buhler"                "Burbank"              
##  [178] "BURBANK"               "Burlingame"            "Burlington"           
##  [181] "Burnsville"            "Burr Ridge"            "Burtonsville"         
##  [184] "Butler"                "CA"                    "Calabasas"            
##  [187] "CALHOUN"               "California"            "Camarillo"            
##  [190] "Camas"                 "Camden"                "Campbell"             
##  [193] "Campbell Hall"         "Campbellsport"         "Canoga Park"          
##  [196] "Canton"                "Cape Canaveral"        "Carbon Hill"          
##  [199] "Caribou"               "Carlsbad"              "Carmel"               
##  [202] "CARMEL"                "Carrollton"            "Carson"               
##  [205] "Cary"                  "Casselberry"           "Castle Pines"         
##  [208] "Castle Rock"           "Catlett"               "Cedar Falls"          
##  [211] "Cedar Park"            "CEDAR PARK"            "Cedar Rapids"         
##  [214] "Centennial"            "CENTENNIAL"            "Center Point"         
##  [217] "Centerville"           "Cerritos"              "Chadds Ford"          
##  [220] "Chambersburg"          "Chandler"              "Chandler,"            
##  [223] "Chantilly"             "CHANTILLY"             "Chapel Hill"          
##  [226] "Charleston"            "Charlotte"             "Charlottesville"      
##  [229] "chatsworth"            "Chatsworth"            "Chattanooga"          
##  [232] "CHATTANOOGA"           "Chelmsford"            "Cher"                 
##  [235] "Cherry Hill"           "Chesapeake"            "Chester"              
##  [238] "Chesterfield"          "Chesterton"            "Chestertown"          
##  [241] "Chevy Chase"           "Cheyenne"              "Chicago"              
##  [244] "Chicago Heights"       "Chicao"                "Chico"                
##  [247] "Chino"                 "Chino Hills"           "Chula Vista"          
##  [250] "Cincinnati"            "CINCINNATI"            "Cincinnati, OH"       
##  [253] "Cincinnnati"           "City of Industry"      "Clarksburg"           
##  [256] "Claymont"              "Clayton"               "clearwater"           
##  [259] "Clearwater"            "CLEARWATER"            "Cleveland"            
##  [262] "Clifton Park"          "Clinton Township"      "Clive"                
##  [265] "Clovis"                "Cockeysville"          "Coconut Creek"        
##  [268] "Coconut Grove"         "Coeur D Alene"         "College Grove"        
##  [271] "College Station"       "COLLEGEVILLE"          "Colleyville"          
##  [274] "Collierville"          "Colonial Beach"        "Colonial Heights"     
##  [277] "COLONIAL HEIGHTS"      "Colorad Springs"       "colorado springs"     
##  [280] "Colorado Springs"      "Columbia"              "Columbus"             
##  [283] "Comfort"               "Commerce"              "Commerce Twp"         
##  [286] "concord"               "Concord"               "Conroe"               
##  [289] "Conshohocken"          "Conway"                "Copley"               
##  [292] "Coppell"               "Coral Gables"          "Coral Springs"        
##  [295] "CORAL SPRINGS"         "Coralville"            "Corona"               
##  [298] "Corpus Christi"        "Cortland"              "Costa Mesa"           
##  [301] "Coto de Caza"          "COTTONWD HTS"          "Covingtom"            
##  [304] "Covington"             "Cranbury"              "Crestview"            
##  [307] "Crestwood"             "Crystal"               "Crystal Lake"         
##  [310] "Culver City"           "Cumming"               "Cypress"              
##  [313] "Dahlonega"             "Dakota Dunes"          "Dallas"               
##  [316] "Danbury"               "Dane"                  "Dania Beach"          
##  [319] "Danvers"               "DANVERS"               "Danville"             
##  [322] "Daphne"                "DAPHNE"                "Darien"               
##  [325] "Davidson"              "Davidsonville"         "Davie"                
##  [328] "DAVIS"                 "Dayton"                "De Pere"              
##  [331] "Decatur"               "Deer Park"             "DEER PARK"            
##  [334] "Deerfield"             "Deerfield Beach"       "Defiance"             
##  [337] "Del City"              "Del Mar"               "Delafield"            
##  [340] "Delavan"               "delray beach"          "Delray Beach"         
##  [343] "Denton"                "Denver"                "Derwood"              
##  [346] "Des Moines"            "Des Plaines"           "Destin"               
##  [349] "Detroit"               "DETROIT LAKES"         "Dexter"               
##  [352] "Diamond Bar"           "doral"                 "Doral"                
##  [355] "Dover"                 "Downers Grove"         "Downingtown"          
##  [358] "Doylestown"            "Draper"                "DRAPER"               
##  [361] "Drexel"                "Drexel Hill"           "Dripping Springs"     
##  [364] "Duarte"                "Dublin"                "Dubuque"              
##  [367] "Dulles"                "Duluth"                "Dumfries"             
##  [370] "Dunbar"                "Dundee"                "Dunwoody"             
##  [373] "Durango"               "durham"                "Durham"               
##  [376] "DURHAM"                "eagan"                 "Eagan"                
##  [379] "Eagle"                 "Eagle Mountain"        "East Boston"          
##  [382] "East Brunswick"        "East Flat Rock"        "East Hampton"         
##  [385] "East Providence"       "Eden"                  "Eden Prairie"         
##  [388] "Eden Prarie"           "Edgewater"             "Edgewood"             
##  [391] "Edina"                 "Edison"                "Edmond"               
##  [394] "EDMOND"                "Effingham"             "Eighty Four"          
##  [397] "El Cajon"              "El Dorado Hills"       "EL PASO"              
##  [400] "El Segundo"            "ELDERSBURG"            "Elgin"                
##  [403] "Elk city"              "Elk Grove Village"     "Elkhart"              
##  [406] "Elkridge"              "ELLICOTT CITY"         "Elm Grove"            
##  [409] "elmhurst"              "Elmhurst"              "Elmsford"             
##  [412] "Elmwood Park"          "Ely"                   "Emerald Isle"         
##  [415] "Emeryville"            "Emmaus"                "EMMAUS"               
##  [418] "Encinitas"             "Encinitass"            "Encino"               
##  [421] "Englewood"             "Erie"                  "Erlanger"             
##  [424] "Escondido"             "Eugene"                "Euless"               
##  [427] "Evans"                 "Evanston"              "Evergreen"            
##  [430] "Ewa Beach"             "EWING"                 "Exeter"               
##  [433] "Exton"                 "Fairfax"               "Fairfield"            
##  [436] "Fairhope"              "Fairlawn"              "Fall River"           
##  [439] "falls church"          "Falls Church"          "FALLS CHURCH"         
##  [442] "Fargo"                 "Farmers Branch"        "Farmington"           
##  [445] "Farmington Hills"      "fayetteville"          "Fayetteville"         
##  [448] "FAYETTEVILLE"          "FELTON"                "Fenton"               
##  [451] "Ferndale"              "Fishers"               "Fletcher"             
##  [454] "Flint"                 "Florida"               "Flowood"              
##  [457] "Flushing"              "Folsom"                "Fontana"              
##  [460] "Foothill Ranch"        "Forest Hill"           "Forked River"         
##  [463] "Fort Collins"          "Fort Lauderdale"       "Fort Lee"             
##  [466] "Fort Mill"             "Fort Myers"            "Fort Pierce"          
##  [469] "Fort Smith"            "Fort Walton Beach"     "Fort Washington"      
##  [472] "Fort Wayne"            "Fort Worth"            "Foster City"          
##  [475] "Fountain Valley"       "Fox Lake"              "Foxborough"           
##  [478] "Framingham"            "FRANKFORD"             "Franklin"             
##  [481] "Frederick"             "Fredericksburg"        "Freeburg"             
##  [484] "FREEHOLD"              "Fremont"               "Fresh Meadows"        
##  [487] "fresno"                "Fresno"                "FRESNO"               
##  [490] "Frisco"                "Frontenac"             "Fruita"               
##  [493] "Ft Collins"            "Ft Worth"              "Ft. Lauderdale"       
##  [496] "fullerton"             "Fullerton"             "Fulton"               
##  [499] "Fuquay-Varina"         "Gainesville"           "Gaithersburg"         
##  [502] "Garden City"           "Garden Grove"          "GARLAND"              
##  [505] "Garnet Valley"         "Gastonia"              "Geneseo"              
##  [508] "Genoa City"            "Georgetown"            "Germantown"           
##  [511] "Gibson"                "Giddings"              "Gig Harbor"           
##  [514] "Gilbert"               "Gillette"              "Glen Allen"           
##  [517] "Glen Burnie"           "Glendale"              "GLENDALE"             
##  [520] "Glendale Heights"      "Glenview Nas"          "Gold River"           
##  [523] "Golden"                "GOLDEN VALLEY"         "Goleta"               
##  [526] "Grafton"               "Grain Valley"          "Grand Junction"       
##  [529] "Grand Prairie"         "Grand Rapids"          "GRAND RAPIDS"         
##  [532] "Grandview"             "Granite Bay"           "Granite Falls"        
##  [535] "Grants Pass"           "Granville"             "Grapevine"            
##  [538] "Grass Valley"          "GREAT NECK"            "Green Bay"            
##  [541] "GREEN BROOK"           "Greenacres"            "Greenbelt"            
##  [544] "Greenland"             "Greensboro"            "Greenville"           
##  [547] "GREENVILLE"            "Greenwich"             "Greenwood"            
##  [550] "Greenwood Village"     "Greer"                 "Grover Beach"         
##  [553] "Guilderland Center"    "hacienda heights"      "Hackensack"           
##  [556] "Hackettstown"          "Haddonfield"           "Hainesport"           
##  [559] "Halfmoon"              "Haltom city"           "Haltom City"          
##  [562] "Hamburg"               "Hamden"                "Hamilton"             
##  [565] "HAMILTON"              "Hampton"               "Hanahan"              
##  [568] "Hanover"               "Harrisburg"            "Harrisonburg"         
##  [571] "Hartford"              "Hartselle"             "Hauppauge"            
##  [574] "HAUPPAUGE"             "Haverhill"             "Havre de Grace"       
##  [577] "Hayden"                "Hayward"               "Heath"                
##  [580] "Henderson"             "Hendersonville"        "Henrico"              
##  [583] "Hermosa Beach"         "Herndon"               "Hialeah"              
##  [586] "Hiawatha"              "Hickory"               "Hicksville"           
##  [589] "High Bridge"           "High Point"            "highland"             
##  [592] "Highland Park"         "Highlands Ranch"       "Hilliard"             
##  [595] "Hillsboro"             "Hillsborough"          "HILLSIDE"             
##  [598] "Hingham"               "Hinsdale"              "hixson"               
##  [601] "Hixson"                "Hoboken"               "HOBOKEN"              
##  [604] "Hoffman Estates"       "Holladay"              "Holland"              
##  [607] "Holly Springs"         "Hollywood"             "Holmdel"              
##  [610] "HOLMEN"                "Holmes"                "Honesdale"            
##  [613] "Honolulu"              "Hoover"                "Hope Mills"           
##  [616] "Hopkinton"             "Hot Springs"           "Houston"              
##  [619] "Howell"                "HOWELL"                "Hudson"               
##  [622] "Hudson Oaks"           "Hudsonville"           "Humble"               
##  [625] "Hunt Valley"           "Huntersville"          "Huntington Beach"     
##  [628] "HUNTINGTON BEACH"      "Huntsville"            "HUNTSVILLE"           
##  [631] "Hyattsville"           "Idaho Falls"           "Ijamsville"           
##  [634] "Independence"          "Indian Trail"          "Indianapolis"         
##  [637] "Indio"                 "Iowa City"             "Irvine"               
##  [640] "IRVINE"                "Irving"                "IRVING"               
##  [643] "Irving Texas"          "Iselin"                "Islandia"             
##  [646] "ISSAQUAH"              "Itasca"                "Jackson"              
##  [649] "Jacksonville"          "Jacksonville Beach"    "JACKSONVILLE BEACH"   
##  [652] "Jamul"                 "Jenks"                 "Jericho"              
##  [655] "Jersey City"           "Johns Creek"           "Johnston"             
##  [658] "Jordan"                "Jupiter"               "Kalispell"            
##  [661] "Kansas City"           "Kasyville"             "Katy"                 
##  [664] "Kaukauna"              "Kaysville"             "Kearny"               
##  [667] "Keller"                "kenilworth"            "Kenilworth"           
##  [670] "kennesaw"              "Kennesaw"              "kenosha"              
##  [673] "Kenosha"               "kent"                  "Kent"                 
##  [676] "Kentwood"              "Kernersville"          "Key West"             
##  [679] "Killeen"               "King"                  "King of Prussia"      
##  [682] "Kingman"               "Kingston"              "Kingwood"             
##  [685] "Kirkland"              "KIRKLAND"              "Kitty Hawk"           
##  [688] "Knightstown"           "Knoxville"             "Kodak"                
##  [691] "Kutztown"              "La Crescenta"          "La Grange"            
##  [694] "La Jolla"              "La Mirada"             "La Porte"             
##  [697] "La Quinta"             "Lacey"                 "Laconia"              
##  [700] "lafayette"             "Lafayette"             "Lagrange"             
##  [703] "LaGrange"              "Laguna Hills"          "Laguna Niguel"        
##  [706] "Lahaina"               "Lake Balboa"           "Lake City"            
##  [709] "LAKE CITY"             "Lake Elsinore"         "Lake Forest"          
##  [712] "Lake Havasu City"      "Lake in the Hills"     "Lake Mary"            
##  [715] "Lake Orion"            "Lake Oswego"           "Lake Success"         
##  [718] "Lakeland"              "Lakeville"             "Lakeway"              
##  [721] "Lakewood"              "LAKEWOOD"              "Lakewood Ranch"       
##  [724] "Lambertville"          "Lancaster"             "Land O' Lakes"        
##  [727] "Landover"              "Lansdale"              "Lansing"              
##  [730] "Larchmont"             "Largo"                 "LARGO"                
##  [733] "Las Vegas"             "LaSalle"               "Latham"               
##  [736] "Laurie"                "Lawndale"              "Lawrenceville"        
##  [739] "Lawton"                "Layton"                "Laytonsville"         
##  [742] "League City"           "Leawood"               "Lebanon"              
##  [745] "LEE"                   "Lee's Summit"          "Leesburg"             
##  [748] "LEESBURG"              "Lehi"                  "LEHI"                 
##  [751] "Lehighton"             "Lemont"                "Lenexa"               
##  [754] "Lenox"                 "Lewiston"              "Lexington"            
##  [757] "Lexington Park"        "LEXINGTON PARK"        "Liberty"              
##  [760] "LIBERTY LAKE"          "Lighthouse Point"      "Lilburn"              
##  [763] "Lima"                  "Lincoln"               "Lincoln City"         
##  [766] "Lincolnshire"          "Lindon"                "Linthicum, MD 21090"  
##  [769] "Lisle"                 "Lititz"                "LITTLE CHUTE"         
##  [772] "Little Rock"           "Littleton"             "Livermore"            
##  [775] "Livingston"            "Livonia"               "Lombard"              
##  [778] "London"                "Long Beach"            "Long Island City"     
##  [781] "Longmont"              "Longwood"              "Lorton"               
##  [784] "Los Alamitos"          "Los Angeles"           "Los Angeles, CA 90036"
##  [787] "Los Gatos"             "Louisiana"             "Louisville"           
##  [790] "LOUISVILLE"            "Louisville KY"         "Loveland"             
##  [793] "Lowell"                "Lubbock"               "Lumberton"            
##  [796] "Luray"                 "Lutz"                  "Lynnwood"             
##  [799] "Macon"                 "Madison"               "Madisonville"         
##  [802] "Mahwah"                "Malibu"                "Malvern"              
##  [805] "Manahawkin"            "Manalapan"             "manassas"             
##  [808] "Manassas"              "MANASSAS"              "Manchester"           
##  [811] "Mandeville"            "Manhattan"             "Manhattan Beach"      
##  [814] "Manitowoc"             "Mankato"               "Mansfield"            
##  [817] "Maple Grove"           "marietta"              "Marietta"             
##  [820] "Marina Del Rey"        "Marlborough"           "Marlton"              
##  [823] "Martinez"              "Mason"                 "Massapequa"           
##  [826] "Matthews"              "Maumee"                "Mayfield Heights"     
##  [829] "McAllen"               "McDonough"             "McFarland"            
##  [832] "McKinney"              "Mclean"                "McLean"               
##  [835] "McMurray"              "Meadow Vista"          "Mechanicsville"       
##  [838] "Medford"               "MEDFORD"               "Media"                
##  [841] "Medina"                "Melbourne"             "Melville"             
##  [844] "MELVILLE"              "memphis"               "Memphis"              
##  [847] "MENDOTA HEIGHTS"       "MENTOR"                "Meridian"             
##  [850] "Mesa"                  "MESA"                  "Metairie"             
##  [853] "Miami"                 "Miami Beach"           "MIAMI BEACH"          
##  [856] "Miami Lakes"           "Miamisburg"            "Michigan City"        
##  [859] "Middle River"          "Middletown"            "Midland"              
##  [862] "Midlothian"            "Midvale"               "Midwest City"         
##  [865] "Milford"               "Milledgeville"         "Millersville"         
##  [868] "Milpitas"              "Milton"                "Milwaukee"            
##  [871] "Milwaukie"             "Minneapolis"           "MINNEAPOLIS"          
##  [874] "Minnetonka"            "MIRAMAR"               "Mishawaka"            
##  [877] "Mission"               "Mission Viejo"         "Missoula"             
##  [880] "Mobile"                "Modesto"               "Mogadore"             
##  [883] "Mohnton"               "Mokena"                "Monroe"               
##  [886] "Monroeville"           "Monsey"                "Montgomery"           
##  [889] "Moorestown"            "Mooresville"           "Moorpark"             
##  [892] "Morgantown"            "Morristown"            "Morrisville"          
##  [895] "MORRISVILLE"           "Mount Dora"            "Mount Laurel"         
##  [898] "mount pleasant"        "Mount Pleasant"        "Mount Vernon"         
##  [901] "Mountain Top"          "Mountain View"         "Mountainside"         
##  [904] "Mt Laurel"             "Mt Pleasant"           "Mt Washington"        
##  [907] "Mt. Holly"             "Mt. Pleasant"          "Mt. Vernon"           
##  [910] "Murray"                "MURRELLS INLET"        "Murrieta"             
##  [913] "MURRIETA"              "N. Huntingdon"         "Nampa"                
##  [916] "Naperrville"           "Naperville"            "NAPERVILLE"           
##  [919] "Naples"                "Nappanee"              "Narberth"             
##  [922] "Nashua"                "Nashville"             "NASHVILLE"            
##  [925] "Natick"                "Navarre"               "Nazareth"             
##  [928] "Needham"               "Neenah"                "Nesconset"            
##  [931] "Nevada City"           "New Albany"            "New Bedford"          
##  [934] "New Braunfels"         "New Haven"             "New Melle"            
##  [937] "New Milford"           "New Orleans"           "NEW ORLEANS"          
##  [940] "New Port Richey"       "New Rochelle"          "NEW ULM"              
##  [943] "New York"              "New York city"         "New York City"        
##  [946] "Newark"                "Newburgh"              "Newburyport"          
##  [949] "Newhall"               "Newnan"                "Newport"              
##  [952] "NEWPORT"               "Newport Beach"         "Newport News"         
##  [955] "Newton"                "NEWTON CENTER"         "Newtown"              
##  [958] "Newtown Square"        "Nicholasville"         "NJ"                   
##  [961] "Noblesville"           "Nokomis"               "Norcross"             
##  [964] "NORCROSS"              "Norfolk"               "Norman"               
##  [967] "Norristown"            "North Andover"         "North Billerica"      
##  [970] "North Brunswick"       "North Charleston"      "North Hollywood"      
##  [973] "North Kansas City"     "North Las Vegas"       "North Liberty"        
##  [976] "North Miami Beach"     "North Myrtle Beach"    "North Riverside"      
##  [979] "North Salt Lake"       "North Smithfield"      "North Venice"         
##  [982] "Northampton"           "Northbrook"            "NORTHBROOK"           
##  [985] "Northport"             "norwalk"               "Norwalk"              
##  [988] "Norwood"               "Novato"                "Novelty"              
##  [991] "Novi"                  "Nutley"                "O'Fallon"             
##  [994] "Oak Brook"             "Oak Park"              "Oak Ridge"            
##  [997] "Oakbrook Terrace"      "Oakdale"               "Oakland"              
## [1000] "Oakland Park"          "Oakwood Village"       "Ocala"                
## [1003] "Ocean"                 "Ocean City"            "Oceanside"            
## [1006] "Odessa"                "Ofallon"               "Ogden"                
## [1009] "Oklahoma City"         "Olathe"                "Oldsmar"              
## [1012] "Olivette"              "Olympia"               "Omaha"                
## [1015] "OMAHA"                 "Ontario"               "Orange"               
## [1018] "Orange Park"           "Orem"                  "orlando"              
## [1021] "Orlando"               "Orrville"              "Oshkosh"              
## [1024] "Overland Park"         "OVERLAND PARK"         "Oviedo"               
## [1027] "Owings Mills"          "Owosso"                "OWOSSO"               
## [1030] "Oxford"                "Oxnard"                "Pacifica"             
## [1033] "Palatine"              "Palm Bay"              "Palm Desert"          
## [1036] "Palm Harbor"           "PALM HARBOR"           "Palmer"               
## [1039] "Palo Alto"             "Paramus"               "Park City"            
## [1042] "Park CIty"             "Parker"                "Parsippany"           
## [1045] "Parsippany-Troy Hills" "Pasadena"              "PASADENA"             
## [1048] "Paso Robles"           "Peabody"               "Peachtree Corners"    
## [1051] "Pearland"              "Pembroke Pines"        "PENN VALLEY"          
## [1054] "Pennsauken"            "Pensacola"             "PENSACOLA"            
## [1057] "Pensacola, FL"         "Perry Hall"            "Perrysburg"           
## [1060] "Petoskey"              "Petroleum"             "Pewaukee"             
## [1063] "Pflugerville"          "Philadelphia"          "Philadephia"          
## [1066] "Philipsburg"           "Phoenix"               "PHOENIX"              
## [1069] "Pikesville"            "Pine Brook"            "Piscataway"           
## [1072] "Pittsburgh"            "Pittsford"             "Plain City"           
## [1075] "Plainfield"            "Plainsboro"            "Plainview"            
## [1078] "Plano"                 "Plant City"            "Plantation"           
## [1081] "Pleasant Grove"        "Pleasanton"            "Pleasantville"        
## [1084] "Please Select"         "plymouth"              "Plymouth"             
## [1087] "Plymouth Meeting"      "Pompano Beach"         "Port Huron"           
## [1090] "Port Vincent"          "Port Washington"       "PORTAGE"              
## [1093] "Porter"                "Porter Ranch"          "Portersville"         
## [1096] "Portland"              "Portsmouth"            "PORTSMOUTH"           
## [1099] "Poseyville"            "Post Falls"            "POST FALLS"           
## [1102] "Potomac"               "Pottstown"             "Poway"                
## [1105] "Powell"                "Prattville"            "Princeton"            
## [1108] "Prospect"              "Provo"                 "Pueblo"               
## [1111] "Purcellville"          "puyallup"              "Quakertown"           
## [1114] "Queensbury"            "Quincy"                "Racine"               
## [1117] "Radnor"                "Raleigh"               "Ramsey"               
## [1120] "Rancho Cordova"        "Rancho Cucamonga"      "Rancho Santa Fe"      
## [1123] "Ranson"                "Rapid City"            "Reading"              
## [1126] "Red Bank"              "redmond"               "Redmond"              
## [1129] "Redondo Beach"         "Redwood City"          "Redwood Shores"       
## [1132] "Reno"                  "RENO"                  "Renson"               
## [1135] "Renton"                "Reston"                "Rhinebeck"            
## [1138] "Rhome"                 "Richardson"            "Richfield"            
## [1141] "Richland"              "richmond"              "Richmond"             
## [1144] "Ridgeland"             "River Falls"           "River Heights"        
## [1147] "Riverside"             "RIVERSIDE"             "Riverton"             
## [1150] "Roanoke"               "Rochelle Park"         "Rochester"            
## [1153] "ROCHESTER"             "Rock Hill"             "Rock Island"          
## [1156] "Rockford"              "Rockland"              "Rocklin"              
## [1159] "Rockville"             "ROCKVILLE"             "Rockwall"             
## [1162] "rocky river"           "Rocky River"           "Rolling Meadows"      
## [1165] "RONKONKOMA"            "Roseburg"              "Roselle"              
## [1168] "ROSEMEAD"              "Rosemont"              "Roseville"            
## [1171] "Rosharon"              "Rosslyn"               "Roswell"              
## [1174] "Round Rock"            "Roy"                   "Royal Oak"            
## [1177] "ROYAL OAK"             "Ruston"                "Rutherfordton"        
## [1180] "Sacramento"            "Safety Harbor"         "SAFETY HARBOR"        
## [1183] "Saint Augustine"       "Saint Charles"         "Saint George"         
## [1186] "Saint Louis"           "Saint Louis Park"      "Saint Paul"           
## [1189] "Saint Peters"          "Saint Petersburg"      "SAINT PETERSBURG"     
## [1192] "Salem"                 "Saline"                "Salisbury"            
## [1195] "Salt Lake City"        "san antonio"           "San Antonio"          
## [1198] "SAN ANTONIO"           "San Carlos"            "SAN CLEMENTE"         
## [1201] "San Diego"             "San Francisco"         "SAN GABRIEL"          
## [1204] "San Jose"              "SAN JOSE"              "San Juan"             
## [1207] "San Juan Capistrano"   "San Luis Obispo"       "San Marcos"           
## [1210] "San Marino"            "San Mateo"             "San Rafael"           
## [1213] "San Ramon"             "Sandusky"              "Sandy"                
## [1216] "Sandy Springs,"        "sanford"               "Sanford"              
## [1219] "SANFORD"               "Santa Ana"             "SANTA ANA"            
## [1222] "Santa Barbara"         "Santa Clara"           "Santa Clarita"        
## [1225] "Santa Cruz"            "SANTA CRUZ"            "Santa Fe Springs"     
## [1228] "Santa Monica"          "Santa Rosa"            "Santa Rosa Beach"     
## [1231] "Santee"                "Saranac"               "Sarasota"             
## [1234] "SARASOTA"              "Saratoga Springs"      "Sausalito"            
## [1237] "Savannah"              "Schaumberg"            "Schenectady"          
## [1240] "Scott AFB"             "scottsdale"            "Scottsdale"           
## [1243] "Seattle"               "Sedona"                "Seven Hills"          
## [1246] "Severna Park"          "Sevierville"           "Sewickley"            
## [1249] "SF"                    "Shaker Heights"        "Sharon"               
## [1252] "SHAWNEE"               "Sheridan"              "Sherman Oaks"         
## [1255] "Shoreview"             "Silver Spring"         "Silverado"            
## [1258] "Simi Valley"           "Simpsonville"          "Sinking Spring"       
## [1261] "Sioux City"            "Sioux Falls"           "Skillman"             
## [1264] "skokie"                "Skokie"                "Smithtown"            
## [1267] "Smock"                 "Solana Beach"          "Solana Bech"          
## [1270] "Solon"                 "Somerset"              "Somerville"           
## [1273] "Sonoma"                "South Bend"            "South Coast Metro"    
## [1276] "South El Monte"        "South Hackensack"      "South Holland"        
## [1279] "SOUTH HOLLAND"         "South Jordan"          "South Londonderry"    
## [1282] "South Miami"           "South Plainfield"      "South River"          
## [1285] "South Salt Lake"       "South San Francisco"   "Southampton"          
## [1288] "Southborough"          "Southfield"            "Southgate"            
## [1291] "Southlake"             "Spanish Fork"          "Spanish Fort"         
## [1294] "Spartanburg"           "Spicewood"             "Spokane"              
## [1297] "Spokane Valley"        "Spring"                "Springfield"          
## [1300] "St Louis"              "ST LOUIS"              "St Louis Park"        
## [1303] "St Paul"               "St Peters"             "St Petersburg"        
## [1306] "ST PETERSBURG"         "St. Augustine"         "St. Charles"          
## [1309] "St. George"            "St. James"             "St. Louis"            
## [1312] "St. Louis MO"          "St. Louis Park"        "St. Paul"             
## [1315] "St. Petersburg"        "St. Rose"              "Stafford"             
## [1318] "Stamford"              "STANTON"               "State College"        
## [1321] "Statesville"           "Steamboat Springs"     "Sterling"             
## [1324] "STEVENSON RANCH"       "Stevensville"          "Stillwater"           
## [1327] "Stoneham"              "Stow"                  "Stratford"            
## [1330] "Strongsville"          "Stuart"                "Studio City"          
## [1333] "sturgis"               "Suffern"               "Sugar Land"           
## [1336] "Sugarland"             "Suisun City"           "Sulphur"              
## [1339] "Summit"                "Sumner"                "Sun Prairie"          
## [1342] "Sun Valley"            "Sunnyvale"             "Suwanee"              
## [1345] "Swannnanoa"            "Swanton"               "Sykesville"           
## [1348] "Syracuse"              "Tabor CIty"            "Tacoma"               
## [1351] "TACOMA"                "Tallahassee"           "Tampa"                
## [1354] "Taylor"                "Taylorsville"          "Taylorville"          
## [1357] "Tea"                   "Teaneck"               "Temecula"             
## [1360] "tempe"                 "Tempe"                 "Temple"               
## [1363] "Temple Terrace"        "Terre Haute"           "The Villages"         
## [1366] "The Woodlands"         "Thomasville"           "Thornton"             
## [1369] "Thousand Oaks"         "Thousand Palms"        "Tigard"               
## [1372] "TIGARD"                "Tinley Park"           "Tinton Falls"         
## [1375] "Toledo"                "TOLEDO"                "tomball"              
## [1378] "Tomball"               "TOMBALL"               "Toms River"           
## [1381] "Topanga"               "Topeka"                "Topsfield"            
## [1384] "Torrance"              "Torrington"            "Towson"               
## [1387] "TOWSON"                "Travelers Rest"        "Trenton"              
## [1390] "Trevose"               "troy"                  "Troy"                 
## [1393] "TROY"                  "Truckee"               "Tucker"               
## [1396] "tucson"                "Tucson"                "tulsa"                
## [1399] "Tulsa"                 "Tumwater"              "Turlock"              
## [1402] "Turnersville"          "Tuscaloosa"            "Tustin"               
## [1405] "Twin Falls"            "Tyler"                 "TYLER"                
## [1408] "Tyrone"                "Tysons"                "Tysons Corner"        
## [1411] "Union"                 "Uniontown"             "Upper Nyack"          
## [1414] "Urbana"                "Urbandale"             "Valencia"             
## [1417] "Valhalla"              "valley cottage"        "Valley Stream"        
## [1420] "Valley View"           "Van Nuys"              "Vancouver"            
## [1423] "VANCOUVER"             "Venice"                "Venice Beach"         
## [1426] "Vernon"                "Vernon Hills"          "VERNON HILLS"         
## [1429] "Vero Beach"            "Victoria"              "Vidalia"              
## [1432] "Vienna"                "VIenna"                "Vineland"             
## [1435] "Virginia Beach"        "Visalia"               "Vista"                
## [1438] "Voorhees Township"     "Waco"                  "Waconia"              
## [1441] "wagoner"               "Wakefield"             "wall"                 
## [1444] "Wall"                  "Waller"                "Wallingford"          
## [1447] "Walnut"                "Walnut Creek"          "Waltham"              
## [1450] "Warner Robins"         "Warren"                "Warrendale"           
## [1453] "Warwick"               "Wash"                  "Washington"           
## [1456] "Washington D.C."       "Washington, D.C."      "Washington, DC"       
## [1459] "waterbury"             "Watertown"             "Waterville"           
## [1462] "Wauconda"              "Waukee"                "Waukesha"             
## [1465] "Waunakee"              "Wausau"                "Waxhaw"               
## [1468] "Wayne"                 "Weatherford"           "Webster"              
## [1471] "Wenatchee"             "West Babylon"          "West Bend"            
## [1474] "West Bloomfield"       "West Chester"          "WEST CHESTER"         
## [1477] "West Columbia"         "West Des Moines"       "West Fargo"           
## [1480] "West Hartford"         "West Haven"            "West Henrietta"       
## [1483] "West Hollywood"        "West Jordan"           "WEST LINN"            
## [1486] "West Long Branch"      "West Memphis"          "West Newbury"         
## [1489] "West Palm Beach"       "West Point"            "West Springfield"     
## [1492] "Westborough"           "Westerville"           "Westford"             
## [1495] "Westlake"              "WESTLAKE"              "Westlake Village"     
## [1498] "WESTLAKE VILLAGE"      "Westland"              "Westminster"          
## [1501] "Weston"                "Westport"              "Westville"            
## [1504] "Westwood"              "Wexford"               "Wharton"              
## [1507] "Wheat Ridge"           "White Plains"          "Whitinsville"         
## [1510] "Wichita"               "Williamston"           "Willoughby"           
## [1513] "Willow Park"           "Wilmington"            "Wilmington, MA"       
## [1516] "Wilminton"             "Wilsonville"           "Wimberley"            
## [1519] "Winchester"            "winder"                "Windham"              
## [1522] "Windham, NH"           "Windsor"               "Windsor Mill"         
## [1525] "Winooski"              "Winston Salem"         "Winter Garden"        
## [1528] "WINTER GARDEN"         "Winter Haven"          "Winter Park"          
## [1531] "WINTER PARK"           "Woburn"                "Woodbine"             
## [1534] "Woodbridge"            "Woodbury"              "Woodcliff Lake"       
## [1537] "Woodinville"           "WOODINVILLE"           "Woodland Hills"       
## [1540] "WOODLAND HILLS"        "Woods Cross"           "Woodstock"            
## [1543] "Worcester"             "Worthington"           "Wrightsville Beach"   
## [1546] "Wylie"                 "Wyncote"               "Wyoming"              
## [1549] "Wyomissing"            "Yakima"                "Yardley"              
## [1552] "Yonkers"               "Yorba Linda"           "York"                 
## [1555] "YORKTOWN"              "Zelienople"            "Zephyrhills"          
## [1558] "Zionsville"

W zmiennej city występują wartości pozornie unikatowe - niektóre miasta są zapisane na różne sposoby (np. przez literówki).

Przy użyciu ChatGPT można wykryć takie przypadki i wygenerować odpowiedni kod, korygujący błędy. Dodatkowo przekonwertujemy dane tak, aby każde słowo było pisane wielką literą, po której następują małe litery:

inc$city <- str_to_title(inc$city)
inc$city <- sub(',.*','',inc$city)
inc$city <- sub(':Livermore','Livermore',inc$city)
inc$city <- sub("Ahaheim", "Anaheim", inc$city)
inc$city <- sub("Birmimgham", "Birmingham", inc$city)
inc$city <- sub("Covingtom", "Covington", inc$city)
inc$city <- sub("Ft Worth", "Fort Worth", inc$city)
inc$city <- sub("Ft. Lauderdale", "Fort Lauderdale", inc$city)
inc$city <- sub("Encinitass", "Encinitas", inc$city)
inc$city <- sub("Colorad Springs", "Colorado Springs", inc$city)
inc$city <- sub("Chicao", "Chicago", inc$city)
inc$city <- sub("Naperrville", "Naperville", inc$city)
inc$city <- sub("Wilminton", "Wilmington", inc$city)         
inc$city <- sub("St. Louis Mo", "St. Louis", inc$city)       
inc$city <- sub("St Louis", "St. Louis", inc$city)           
inc$city <- sub("St Petersburg", "St. Petersburg", inc$city)
inc$city <- sub("San Diago", "San Diego", inc$city)
inc$city <- sub("San José", "San Jose", inc$city)     
inc$city <- sub("St Paul", "St. Paul", inc$city)
inc$city <- sub("Santa Rosa Beach", "Santa Rosa", inc$city)

Ponowne zbadanie struktury danych

W ostatnim kroku jeszcze raz ułożymy kolumny w wyjściowej ramce danych i sprawdzimy strukturę danych.

inc <- select(inc,rank,name,industry, 
              revenue,three_years_growth_percent,workers,
              founded,yrs_on_list,
              state,metro,city, url)
str(inc)
## 'data.frame':    5012 obs. of  12 variables:
##  $ rank                      : Factor w/ 4999 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ name                      : chr  "Freestar" "FreightWise" "Cece's Veggie Co." "LadyBoss" ...
##  $ industry                  : Factor w/ 27 levels "Advertising & Marketing",..: 1 19 11 5 23 13 5 26 13 1 ...
##  $ revenue                   : num  36.9 33.6 24.9 32.4 22.5 ...
##  $ three_years_growth_percent: num  36.7 30.5 23.9 21.9 18.2 ...
##  $ workers                   : int  40 39 190 57 25 742 12 72 60 37 ...
##  $ founded                   : num  2015 2015 2015 2014 2014 ...
##  $ yrs_on_list               : Factor w/ 14 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ state                     : Factor w/ 51 levels "AL","AR","AZ",..: 3 43 44 32 38 9 31 46 35 4 ...
##  $ metro                     : chr  "Phoenix" "Nashville" "Austin" NA ...
##  $ city                      : chr  "Phoenix" "Brentwood" "Austin" "Albuquerque" ...
##  $ url                       : chr  "http://freestar.com" "http://freightwisellc.com" "http://cecesveggieco.com" "http://ladyboss.com" ...

W ostatnim kroku skorygowany plik zostanie zapisany na dysku na potrzeby dalszych analiz:

write.csv(inc, file = "inc_corrected.csv",
          row.names = FALSE, fileEncoding = "UTF-8")